Reading own writes using context objects in a distributed database

ABSTRACT

A context object is created when a write operation is initiated. The client application or user performs a write operation to a table and receives a context object which has information on all the tablets that are impacted by writes by the client application. The context object may contain a list describing what key ranges the client application has updated. As such, only that small portion of what has been touched needs to be looked for and this typically only includes a small subset of tablets compared to all the tablets associated with that table. This leads to a small verification cost when checking only impacted tablets in the cluster. The only portion of the table in the database that needs to be verified is the one or more portions that were updated and nothing else.

FIELD

The invention relates to data processing. More particularly, the invention relates to reading own writes using context objects in a distributed database.

BACKGROUND

As is known, client applications make frequent queries and updates to databases which, in many current computing environments, are distributed. A table in a database may be distributed over multiple nodes, e.g. 100s of nodes.

As is known, client applications make frequent queries and updates to databases. In many current computing environments, these databases are distributed. A table in a database may be distributed over multiple nodes, e.g. 100s of nodes.

As is known in the field, a table may also have what is referred to as a secondary index. When a write is done to a table initiated by an application, the index for that table is also updated, typically to facilitate subsequent queries. In a distributed system, the index is generally updated asynchronously; this means that the write returns as soon as the table itself is updated, but before the index is confirmed to have been updated. Remote replicas of the table might be updated in a similar asynchronous fashion. Because an acknowledgement is sent to the client as soon as the table itself has been updated, but before the index and remote replicas are updated, it is possible for the client application to initiate a read to the index and remote replicas before the previous update has been applied to them.

The delay before this update is propagated to all indexes and remote replicas depends, naturally, on the size and complexity of the distributed computing system, as well as the extent that the specific table is distributed over the system. Because the index is not updated immediately, it is possible that the user who initiated the write operation and performs a read immediately on the same table does not see the write to the table if they attempt to read the table by using the index, The user may see older data when using the secondary index that has not been updated before the read occurs, but see newer data if the table is accessed directly. This can make it difficult to write programs that function correctly. Likewise, the client application may cause some other application to issue a read that may or may not use the index or the main table. If that read occurs soon after the acknowledgment of the original write, inconsistent results may also be observed by this other application as well.

In many applications, it has become increasingly important for a client application to see data that is consistent with the recent actions of the client application in the primary table and secondary index after the write is performed. No matter how quickly the write is executed, the client application must be able to guarantee that subsequent operations see the results of the write. This is presently not guaranteed or even verifiable with distributed databases. It would be preferable if the client application could verify that reads from the secondary index at any time are true and up-to-date relative to all writes done by that application.

It is not, however, acceptable to check each subdivision of a table and each subdivision of each index or remote replica to determine whether all consequences a write has completed. As noted, a distributed table is spread out among multiple nodes. A node, for example, can have computing, storage, and storage access capabilities, for instance sixteen CPUs and either disk drives or solid state drives for storage. For example, one node can store ten terabytes of data. A table is comprised of multiple subdivisions, also referred to as tablets.

Typically, a table is comprised of hundreds of tablets. Data in a table is stored in these tablets, typically organized by the primary key of the table. As noted, the time it takes to verify that a corresponding index for a tablet is up-to-date is directly proportional to the number of tablets comprising the table. Each and every one of the subdivisions that has data belonging to the table needs to be checked to see if a corresponding index has been updated for this tablet. This is a significant time consuming task.

SUMMARY

In embodiments, a context object is created when a write operation is initiated. The client application or user performs a write operation to a table and receives a context object that has information on all the tablets that are impacted by writes by the client application.

In embodiments, the context object may contain a list describing what key ranges the client application has updated. As such, only that small portion of what has been touched needs to be looked for and this typically only includes a small subset of tablets compared to all the tablets associated with that table. This leads to a small verification cost when checking only impacted tablets in the cluster. The only portion of the table in the database that needs to be verified is the one or more portions that were updated and nothing else.

DRAWINGS

FIG. 1 is a block diagram showing a mechanism for reading own writes using context objects in a distributed database;

FIG. 2 is a flow diagram showing an overview of the process for reading own writes using context objects in a distributed database; and

FIG. 3 is a block schematic diagram showing a machine in the example form of a computer system within which a set of instructions for causing the machine to perform one or more of the methodologies discussed herein may be executed.

DESCRIPTION

In embodiments, a context object is created when a write operation is done. The client application or user performs a write operation to a table by dividing the write into subsidiary writes according to the allocation of key ranges to tablets in the table and then sending each part of the write operation to the each tablet affected. From each tablet, the client application receives a context object that has information on the write operation for that tablet. Taken together, the information in the context objects returned from each tablet regarding the individual writes by the client application can be combined into a context object for the entire write operation. In many cases, of course, a write updates only a single row. In such a case, the context object for the write operation would be identical to the context object returned from the single affected tablet.

In embodiments, the context object may contain a list describing what key ranges the client application has updated. As such, only that small portion of what has been touched needs to be looked for and this typically only includes a small subset of tablets compared to all the tablets associated with that table. This leads to a small verification cost when checking only impacted tablets in the cluster. The only portion of the table in the database that needs to be verified is the one or more portions that were updated and nothing else.

The context object also contains timestamp data. The context object may contain a timestamp for the updates that is the latest timestamp of any of the updates to any tablet in the table. The context object may alternatively contain a timestamp for each key range that was updated.

Each write to table consists of writes to one or more tablets. Each tablet responds to a write operation by returning a confirmation of the write along with the timestamp associated with each write. Asynchronously, each tablet writes records to any secondary indexes and maintains the minimum timestamp of any pending secondary index writes. Because of the monotonicity of timestamps within each tablet, it is possible to determine whether any particular write has propagated successfully to the index by testing this minimum timestamp against the timestamp for the write.

When a client application reads data where the read is intended to be performed only after all indexes reflect the results of a previous write, the context object associated with that write can be used to test efficiently and possibly to wait for the completion of all indexing operations. At a minimum, the pending index timestamp for any tablets that are being queried can compared to the timestamp or timestamps in the context object. If query ranges affected by the write are retained in the context object, then the number of tablets to be queried can be reduced to the intersection of those involved in the query and those involved in the write.

The tablets can maintain independent statistical estimates of the indexing delays that they are seeing. They can combine this information with the timestamps of pending indexing operations to get an optimistic estimate of the earliest time that there is a significant probability that a particular timestamp is cleared. This estimate can be returned to the client application and used as a time to delay until the next time that the client application queries the tablet to see if all writes of interest have been indexed.

In embodiments, the contents of a context object can be passed from one client application to another. The client application that receives the context object can perform the test for index currency exactly the way that the original client application could do.

In embodiments, the contents of a context object can be returned to a web browser in the form of a cookie. This cookie can be decoded back into a context object by a Web application server and then used to verify that indexing operations have completed relative to a particular write.

In embodiments, context objects can be combined so that the indexing status of all of multiple writes can be tested in a single test. The combination would require that the union of all affected tables be kept and that the maximum of any timestamps be kept.

In this manner, the client application making the update to the table is guaranteed that an immediate subsequent read or write operation has a secondary index for the table that is true and up-to-date, reflecting the write that was just made by that user. Each subdivision or tablet in a node has its own internal structure for storing information on what data in the table has been changed and how much of that changed data has been synchronized with tables of the secondary index for that table.

FIG. 1 is a block diagram showing a mechanism for reading own writes using context objects in a distributed database.

In FIG. 1, a client application 101 makes one or more write requests 102 from any thread to tablets 103 that are associated with a table 104. The tablets 103 later perform index writes 105 to an index table 106 also composed of tables 107. The timestamps of the original changes for pending index writes are maintained so that each tablet can determine whether a write has been propagated successfully by checking a timestamp.

The tablets return one or more context objects 107 as a result of the write request 102, typically long before the index writes 105 have completed. Subsequently, the client application 101 or any other program with a copy of the context object can probe the tables to determine if the changes due to the original write request 102 have been propagated to the tables.

FIG. 2 is a flow diagram showing an overview of the process for reading own writes using context objects in a distributed database.

In FIG. 2:

The client application or user performs a write operation to a table (200) and receives a context object which has information on all the tablets that are impacted by writes by the client application (210).

The context object contains a list describing what key ranges the client application has updated. The context object also contains timestamp data (220).

Each write to the table consists of writes to one or more tablets (230).

Each tablet responds to a write operation by returning a confirmation of the write along with the timestamp associated with each write (240).

Each tablet writes records to any secondary indexes asynchronously and maintains the minimum timestamp of any pending secondary index writes (250).

When a client application reads data, the context object associated with that write can be used to test efficiently and possibly to wait for the completion of all indexing operations (260).

As noted, a context object has a subdivision identifier and a record of what writes were made to that subdivision by a specific user. Context objects require very little overhead with respect to storage or processing yet provide a significant advantage to users who require a guarantee and verification that secondary indexes are updated with previous writes to a primary table.

A similar mechanism can be used to track non-local replicas of the table. In such a case, each replica of any tablet would have a minimum timestamp for any pending replication to the replica of the tablet. This timestamp could be used to determine whether any data written by a client application had been replicated to the replica table. Each tablet in the replica can also record the last applied timestamp of any replication from another tablet. This allows the use of a context object to probe a replica to find out if the data written as part of the write associated with the context object has been fully propagated to the replica table.

Computer Implementation

FIG. 3 is a block diagram of a computer system that may be used to implement certain features of some of the embodiments of the invention. The computer system may be a server computer, a client computer, a personal computer (PC), a user device, a tablet PC, a laptop computer, a personal digital assistant (PDA), a cellular telephone, an iPhone, an iPad, a Blackberry, a processor, a telephone, a web appliance, a network router, switch or bridge, a console, a hand-held console, a (hand-held) gaming device, a music player, any portable, mobile, hand-held device, wearable device, or any machine capable of executing a set of instructions, sequential or otherwise, that specify actions to be taken by that machine.

The computing system 30 may include one or more central processing units (“processors”) 35, memory 31, input/output devices 34, e.g. keyboard and pointing devices, touch devices, display devices, storage devices 32, e.g. disk drives, and network adapters 33, e.g. network interfaces, that are connected to an interconnect 36.

In FIG. 3, the interconnect is illustrated as an abstraction that represents any one or more separate physical buses, point-to-point connections, or both connected by appropriate bridges, adapters, or controllers. The interconnect, therefore, may include, for example a system bus, a peripheral component interconnect (PCI) bus or PCI-Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (12C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also referred to as Firewire.

The memory 31 and storage devices 32 are computer-readable storage media that may store instructions that implement at least portions of the various embodiments of the invention. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, e.g. a signal on a communications link. Various communications links may be used, e.g. the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer readable media can include computer-readable storage media, e.g. non-transitory media, and computer-readable transmission media.

The instructions stored in memory 31 can be implemented as software and/or firmware to program one or more processors to carry out the actions described above. In some embodiments of the invention, such software or firmware may be initially provided to the processing system 40 by downloading it from a remote system through the computing system, e.g. via the network adapter 33.

The various embodiments of he invention introduced herein can be implemented by, for example, programmable circuitry, e.g. one or more microprocessors, programmed with software and/or firmware, entirely in special-purpose hardwired, i.e. non-programmable, circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.

Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. 

1. A method for reading own writes using context objects in a distributed database, comprising: a client application initiating a write operation to a table in said distributed database; when said write operation is initiated, creating a context object; and said client application receiving said context object, said context object comprising information on all tablets within said database that are impacted by said write operation by said client application.
 2. The method of claim 1, wherein said context object further comprises a list describing key ranges that said client application has updated.
 3. The method of claim 1, wherein said context object further comprises timestamp data.
 4. The method of claim 3, wherein said context object comprises a timestamp for updates resulting from said write operation comprising a latest timestamp of any update to any tablet in said table.
 5. The method of claim 3, wherein said context object comprises a timestamp for each key range that was updated by said write operation.
 6. The method of claim 1, retaining query ranges affected by said write operation in said context object to reduce a number of tablets to be queried to an intersection of those tablets involved in a query and those tablets involved in said write operation, wherein the write operation to said table comprises writes to one or more tablets.
 7. The method of claim 1, further comprising: each tablet responding to a write operation by returning a confirmation of the write along with the timestamp associated with each write.
 8. The method of claim 1, further comprising: each tablet asynchronously writing records to one or more secondary indexes; and each tablet maintaining a minimum timestamp of any pending secondary index writes.
 9. The method of claim 8, further comprising: determining when a particular write has propagated successfully to said secondary indexes by testing said minimum timestamp against a timestamp for said write operation.
 10. The method of claim 1, further comprising: associating said context object with a write operation to test efficiently and, if necessary, to wait for completion of all indexing operations when said client application reads data, where said read operation is to be performed only after all indexes reflect results of said write operation.
 11. The method of claim 1, further comprising: comparing a pending index timestamp for any tablets that are being queried to a timestamp or timestamps in said context object.
 12. The method of claim 1, further comprising: maintaining independent statistical estimates of indexing delays in said tablets; combining said independent statistical estimates of indexing delays with timestamps of pending indexing operations to determine an optimistic estimate of an earliest time that there is a significant probability that a particular timestamp is cleared; returning said estimate to said client application; and said client application using said estimate as a time to delay until a next time that said client application queries said tablets to determine if all write operations of interest have been indexed.
 13. The method of claim 1, further comprising: passing the contents of a context object from one client application to another client application, wherein the client application receiving said context object performs a test for index currency in the same manner as that of the client application from which the context object was received.
 14. The method of claim 1, further comprising: returning the contents of a context object to a Web browser in the form of a cookie; decoding said cookie back into a context object by a Web application server; and using said context object to verify that indexing operations have completed relative to a particular write operation.
 15. The method of claim 1, further comprising: combining context objects to test indexing status of all of multiple writes in a single test.
 16. The method of claim 15, further comprising: for said combination, keeping a union of all affected tables and a maximum of any timestamps; wherein a client application making an update to said table is guaranteed that an immediate subsequent read or write operation has a secondary index for the table that is true and up-to-date, reflecting the write operation that was just made by that client application.
 17. The method of claim 16, further comprising: each tablet in a node having its own internal structure for storing information on what data in the table has been changed and how much of that changed data has been synchronized with tables of a secondary index for that table.
 18. The method of claim 1, further comprising: tracking non-local replicas of said table with each context object for each replica of any tablet by providing a minimum timestamp for any pending replication to the replica of the tablet; using said timestamp to determine whether any data written by a client application had been replicated to a replica table; each tablet in a replica recording a last applied timestamp of any replication from another tablet; and said context object probing a replica to determine if data written as part of the write operation associated with the context object has been fully propagated to the replica table.
 19. An apparatus for reading own writes using context objects in a distributed database, comprising: a client application configured to make one or more write requests to a plurality of tablets that comprise said distributed database and that are associated with a table; said plurality of tablets each configured to perform index writes to an index table; one or more context objects returned by said tablets in response to said write request before said index writes have completed; said one or more context objects including a plurality of corresponding timestamps, each timestamp associated with a corresponding one of said tablets, said timestamps configured to maintain original changes for pending index writes for each said tablet, wherein said timestamps for each said tablet are used to determine whether a write has been propagated successfully; and wherein said client application or any other program with a copy of the context object can probe said tables to determine if changes due to an original write request have been propagated to said tables.
 20. A method for reading own writes using context objects in a distributed database, comprising: a client application or user performing a write operation to a table in a database comprised of a plurality of tablets; said client application receiving a context object containing information for all tablets that are impacted by said write operation, said information comprising a list of key ranges of the client application that have been updated as a result of said write operation and corresponding timestamp data; each write operation to said table comprising writes to one or more of said tablets; each tablet responding to a write operation by returning a confirmation of the write operation along with a timestamp associated with each write operation; each tablet asynchronously writing records to any secondary indexes and maintaining a minimum timestamp of any pending secondary index writes; and using said context object to test efficiently and to determine completion of all indexing operations when a client application reads data. 